CIEMPIESS: A New Open-Sourced Mexican Spanish Radio Corpus
نویسندگان
چکیده
This paper presents the development of the “Corpus de Investigación en Español de México del Posgrado de Ingenierı́a Eléctrica y Servicio Social” (CIEMPIESS) that is a new open-sourced corpus extracted from Spanish spoken FM podcasts in the dialect of the center of Mexico. The CIEMPIESS corpus was designed to be used in the field of automatic speech recongnition (ASR) and it is provided with two different kind of pronouncing dictionaries, one of them containing the phonemes of Mexican Spanish and the other containing this same phonemes plus allophones. Corpus annotation took into account the tonic vowel of every word and the four different sounds that letter “x” presents in the Spanish language. CIEMPIESS corpus is also provided with two different language models extracted from electronic newsletters, one of them takes into account the tonic vowels but not the other one. Both the dictionaries and the language models allow users to experiment different scenarios for the recognition task in order to adequate the corpus to their needs.
منابع مشابه
DIMEx100: A New Phonetic and Speech Corpus for Mexican Spanish
In this paper the phonetic and speech corpus DIMEx100 for Mexican Spanish is presented. We discuss both the linguistic motivation and the computational tools employed for the design, collection and transcription of the corpus. The phonetic transcription methodology is based on recent empirical studies proposing a new basic set of allophones and phonological rules for the dialect of the central ...
متن کاملCultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis
This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...
متن کاملCompilation of a Mexican Spanish text corpora
-Collections of texts with syntactic annotation are nowadays useful resources. They are employed for diverse tasks in theoretical research and natural language applications. The most important collections are dedicated to English. But huge efforts have being realized to develop the corresponding to other languages. In this work we present the initial steps for the compilation of a Mexican Spani...
متن کاملVOXMEX Speech Database: Design of a Phonetically Balanced Corpus
We present a method for designing a phonetically balanced speech corpus. In this method, we used a phonotactic approach to design the phonetic content of VOXMEX: a phonetically balanced corpus for Mexican Spanish. The transcriptions of VOXMEX contain a complete coverage of phonemes and allophones of Mexican Spanish in every possible context. This corpus is designed for doing phonetic research a...
متن کاملMeasures of speech rhythm and the role of corpus-based word frequency: a multifactorial comparison of Spanish(-English) speakers
In this study, we address various measures that have been employed to distinguish between syllable and stresstimed languages. This study differs from all previous ones by (i) exploring and comparing multiple metrics within a quantitative and multifactorial perspective and by (ii) also documenting the impact of corpus-based word frequency. We begin with the basic distinctions of speech rhythms, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014